Feat/ci tests v2 framework cpu hog by ddjain · Pull Request #1180 · krkn-chaos/krkn

ddjain · 2026-03-06T12:32:51Z

Type of change

Refactor
New feature
Bug fix
Optimization

Description

Scenario

CPU hog (hog_scenarios): runs a CPU stress workload on selected nodes for a set duration, then removes the hog pods.

Test cases

Success and lifecycle – Scenario runs, at least one hog pod appears during the run, process exits 0, no hog pods left after run.
Node selector and duration – Scenario runs with node-selector=kubernetes.io/os=linux and duration=10; exit 0 and run time in expected range (~8–90s).
Invalid node selector – Node selector matches no nodes; Krkn exits with failure (non-zero).
Invalid scenario YAML – Invalid scenario file; Krkn exits with failure (non-zero).
Tests use ephemeral namespaces (no pre-deployed workload)

Related Tickets & Documents

If no related issue, please create one and start the converasation on wants of

Related Issue #:
Closes #:

…isolation Signed-off-by: ddjain <darjain@redhat.com>

Signed-off-by: ddjain <darjain@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Signed-off-by: ddjain <darjain@redhat.com>

qodo-code-review · 2026-03-09T10:49:58Z

Review Summary by Qodo

Add CPU hog scenario functional tests to v2 framework

✨ Enhancement 🧪 Tests

Walkthroughs

Description

• Add CPU hog scenario functional tests with ephemeral namespace isolation
• Implement test cases for success lifecycle, node selector, duration validation
• Add failure scenarios for invalid node selector and malformed YAML
• Create base scenario configuration for CPU hog stress testing

Diagram

flowchart LR
  A["CPU Hog Test Suite"] --> B["Success & Lifecycle Test"]
  A --> C["Node Selector & Duration Test"]
  A --> D["Invalid Node Selector Test"]
  A --> E["Invalid Scenario YAML Test"]
  B --> F["Verify Pod Creation & Cleanup"]
  C --> G["Validate Timing & Exit Code"]
  D --> H["Assert Kraken Failure"]
  E --> H
  A --> I["Base Scenario Config"]

File Changes

1. CI/tests_v2/scenarios/cpu_hog/test_cpu_hog.py 🧪 Tests +121/-0

CPU hog scenario functional test suite implementation

• Implement four functional test cases for CPU hog scenario execution
• Test pod lifecycle: creation during run and cleanup after completion
• Validate node selector targeting and duration constraints
• Verify graceful failure handling for invalid configurations
• Add helper functions to poll and retrieve hog pods from namespace

CI/tests_v2/scenarios/cpu_hog/test_cpu_hog.py

2. CI/tests_v2/pytest.ini ⚙️ Configuration changes +1/-0

Add cpu_hog pytest marker configuration

• Add cpu_hog pytest marker for CPU hog scenario test identification
• Enable selective test execution and filtering by scenario type

CI/tests_v2/pytest.ini

3. CI/tests_v2/scenarios/cpu_hog/scenario_base.yaml ⚙️ Configuration changes +12/-0

Base CPU hog scenario configuration template

• Define base CPU hog scenario configuration with default parameters
• Set duration, worker count, CPU load percentage, and image reference
• Configure namespace and node selector targeting for test execution
• Specify hog type as CPU with all CPU methods and single node target

CI/tests_v2/scenarios/cpu_hog/scenario_base.yaml

qodo-code-review · 2026-03-09T10:49:59Z

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0)

1. Unhandled kraken timeout 🐞 Bug ⛯ Reliability

Description

test_cpu_hog_success_and_lifecycle calls proc.communicate(timeout=90) without handling subprocess
timeout, so a slow/hung kraken run will raise and leave the background process running (with
stdout/stderr pipes) and can hang the overall test session.

Code

CI/tests_v2/scenarios/cpu_hog/test_cpu_hog.py[R56-64]

+        proc = self.run_kraken_background(config_path)
+        try:
+            pods = _wait_for_hog_pod(
+                self.k8s_core, ns, self.HOG_POD_PREFIX, timeout=POLICY_WAIT_TIMEOUT
+            )
+            assert len(pods) >= 1, f"Expected at least one hog pod in namespace={ns}"
+        finally:
+            # duration=10 + pod wait (30s) + cleanup; allow 90s for Krkn to exit.
+            stdout, stderr = proc.communicate(timeout=90)

Evidence

The test starts kraken using a Popen with stdout/stderr pipes and then unconditionally calls
communicate(timeout=90) in a finally block without any termination/kill fallback, so a
TimeoutExpired will abort cleanup and can leave the subprocess alive. The suite already defines
overridable timeout constants (including KRAKEN_PROC_WAIT_TIMEOUT), but this test hard-codes 90
seconds, making behavior inconsistent when env overrides are used.

CI/tests_v2/scenarios/cpu_hog/test_cpu_hog.py[45-65]
CI/tests_v2/lib/kraken.py[44-58]
CI/tests_v2/lib/base.py[38-46]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`test_cpu_hog_success_and_lifecycle` starts Kraken in the background and calls `proc.communicate(timeout=90)` without handling `subprocess.TimeoutExpired`. If Kraken hangs or runs longer than expected, the test errors out and can leave the subprocess running (with stdout/stderr pipes), which may stall the test session and leak cluster resources.

### Issue Context
- `run_kraken_background` uses `stdout=PIPE` and `stderr=PIPE`.
- The test suite already defines configurable timeout constants (`KRAKEN_PROC_WAIT_TIMEOUT`, `TIMEOUT_BUDGET`, etc.) via env vars.

### Fix Focus Areas
- CI/tests_v2/scenarios/cpu_hog/test_cpu_hog.py[56-69]
- CI/tests_v2/lib/base.py[38-46]

### Suggested implementation notes
- Wrap `proc.communicate(...)` in `try/except subprocess.TimeoutExpired`.
- On timeout: `proc.terminate()` then `proc.kill()` if still running; drain output; then fail with a clear message.
- Replace the hard-coded `90` with `KRAKEN_PROC_WAIT_TIMEOUT` or a computed timeout derived from `scenario[&#x27;duration&#x27;]` plus a buffer.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Logs dropped on pod-wait 🐞 Bug ✧ Quality

Description

If no hog pod appears and _wait_for_hog_pod raises TimeoutError, the test never reaches the code
that wraps stdout/stderr and calls assert_kraken_success, so the kraken output captured in the
finally block is not persisted or shown, making CI failures hard to debug.

Code

CI/tests_v2/scenarios/cpu_hog/test_cpu_hog.py[R56-70]

+        proc = self.run_kraken_background(config_path)
+        try:
+            pods = _wait_for_hog_pod(
+                self.k8s_core, ns, self.HOG_POD_PREFIX, timeout=POLICY_WAIT_TIMEOUT
+            )
+            assert len(pods) >= 1, f"Expected at least one hog pod in namespace={ns}"
+        finally:
+            # duration=10 + pod wait (30s) + cleanup; allow 90s for Krkn to exit.
+            stdout, stderr = proc.communicate(timeout=90)
+        result = SimpleNamespace(
+            returncode=proc.returncode,
+            stdout=stdout or "",
+            stderr=stderr or "",
+        )
+        assert_kraken_success(result, context=f"namespace={ns}", tmp_path=self.tmp_path)

Evidence

The test waits for hog pods inside a try and then builds the result object and calls
assert_kraken_success only after the try/finally. Any exception from _wait_for_hog_pod exits the
test before the result/assertion path, and unlike assert_kraken_success (which writes logs to
tmp_path on failure), there is no log persistence in this exception path.

CI/tests_v2/scenarios/cpu_hog/test_cpu_hog.py[56-70]
CI/tests_v2/lib/utils.py[166-188]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
When `_wait_for_hog_pod(...)` times out, the test raises `TimeoutError` before it creates the `result` object and calls `assert_kraken_success`. Although `stdout/stderr` are obtained in the `finally` block, they are not persisted or surfaced on this exception path.

### Issue Context
`assert_kraken_success`/`assert_kraken_failure` already have a convention of writing `kraken_stdout.log` / `kraken_stderr.log` to `tmp_path`, but this path bypasses those helpers.

### Fix Focus Areas
- CI/tests_v2/scenarios/cpu_hog/test_cpu_hog.py[56-75]
- CI/tests_v2/lib/utils.py[166-188]

### Suggested implementation notes
- Capture exceptions from the pod-wait block (e.g., `except Exception as e: exc = e`) and in `finally` write `stdout/stderr` to `tmp_path` before re-raising/failing.
- Alternatively, convert the timeout into an `AssertionError` that includes the last N lines of kraken stdout/stderr and points to the log files.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

qodo-code-review · 2026-03-09T10:53:49Z

CI/tests_v2/scenarios/cpu_hog/test_cpu_hog.py

+        proc = self.run_kraken_background(config_path)
+        try:
+            pods = _wait_for_hog_pod(
+                self.k8s_core, ns, self.HOG_POD_PREFIX, timeout=POLICY_WAIT_TIMEOUT
+            )
+            assert len(pods) >= 1, f"Expected at least one hog pod in namespace={ns}"
+        finally:
+            # duration=10 + pod wait (30s) + cleanup; allow 90s for Krkn to exit.
+            stdout, stderr = proc.communicate(timeout=90)


1. Unhandled kraken timeout 🐞 Bug ⛯ Reliability

test_cpu_hog_success_and_lifecycle calls proc.communicate(timeout=90) without handling subprocess timeout, so a slow/hung kraken run will raise and leave the background process running (with stdout/stderr pipes) and can hang the overall test session.

Agent Prompt

### Issue description `test_cpu_hog_success_and_lifecycle` starts Kraken in the background and calls `proc.communicate(timeout=90)` without handling `subprocess.TimeoutExpired`. If Kraken hangs or runs longer than expected, the test errors out and can leave the subprocess running (with stdout/stderr pipes), which may stall the test session and leak cluster resources. ### Issue Context - `run_kraken_background` uses `stdout=PIPE` and `stderr=PIPE`. - The test suite already defines configurable timeout constants (`KRAKEN_PROC_WAIT_TIMEOUT`, `TIMEOUT_BUDGET`, etc.) via env vars. ### Fix Focus Areas - CI/tests_v2/scenarios/cpu_hog/test_cpu_hog.py[56-69] - CI/tests_v2/lib/base.py[38-46] ### Suggested implementation notes - Wrap `proc.communicate(...)` in `try/except subprocess.TimeoutExpired`. - On timeout: `proc.terminate()` then `proc.kill()` if still running; drain output; then fail with a clear message. - Replace the hard-coded `90` with `KRAKEN_PROC_WAIT_TIMEOUT` or a computed timeout derived from `scenario['duration']` plus a buffer.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ddjain and others added 20 commits February 24, 2026 16:07

feat: add pytest-based CI test framework v2 with ephemeral namespace …

1a5cd9f

…isolation Signed-off-by: ddjain <darjain@redhat.com>

feat(ci): add tests_v2 pytest functional test framework

fd46c95

Signed-off-by: ddjain <darjain@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>

feat: improve naming convention

3685d7d

Signed-off-by: ddjain <darjain@redhat.com>

improve local setup script.

e80cf84

Signed-off-by: ddjain <darjain@redhat.com>

added CI job for v2 test

10189bd

Signed-off-by: ddjain <darjain@redhat.com>

disabled broken test

6d28834

Signed-off-by: ddjain <darjain@redhat.com>

improved CI pipeline execution time

8abd7b5

Signed-off-by: ddjain <darjain@redhat.com>

Merge branch 'main' into feat/ci-tests-v2-framework

d7e6f2d

chore: remove unwanted/generated files from PR

ca6a09e

Signed-off-by: ddjain <darjain@redhat.com>

clean up gitignore file

5e36507

Signed-off-by: ddjain <darjain@redhat.com>

fix copilot comments

3141981

Signed-off-by: ddjain <darjain@redhat.com>

fixed copilot suggestion

a7d5af7

Signed-off-by: ddjain <darjain@redhat.com>

uncommented out test upload stage

16cab3e

Signed-off-by: ddjain <darjain@redhat.com>

exclude CI/tests_v2 from test coverage reporting

0677659

Signed-off-by: ddjain <darjain@redhat.com>

uploading style.css to fix broken report artifacts

e667eee

Signed-off-by: ddjain <darjain@redhat.com>

Merge branch 'main' into feat/ci-tests-v2-framework

a26092a

added openshift supported labels in namespace creatation api

01988dd

Signed-off-by: ddjain <darjain@redhat.com>

Merge branch 'main' into feat/ci-tests-v2-framework

976fd29

feat(ci): add tests_v2 cpu_hog scenario

b9a29d3

Signed-off-by: ddjain <darjain@redhat.com>

Merge branch 'main' into feat/ci-tests-v2-framework-cpu-hog

a11b483

ddjain marked this pull request as ready for review March 9, 2026 10:49

ddjain requested review from chaitanyaenr, paigerube14 and tsebastiani as code owners March 9, 2026 10:49

qodo-code-review bot reviewed Mar 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/ci tests v2 framework cpu hog#1180

Feat/ci tests v2 framework cpu hog#1180
ddjain wants to merge 20 commits intokrkn-chaos:mainfrom
ddjain:feat/ci-tests-v2-framework-cpu-hog

ddjain commented Mar 6, 2026 •

edited

Loading

Uh oh!

qodo-code-review bot commented Mar 9, 2026

Uh oh!

qodo-code-review bot commented Mar 9, 2026 •

edited

Loading

Uh oh!

qodo-code-review bot Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ddjain commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type of change

Description

Scenario

Test cases

Related Tickets & Documents

Uh oh!

qodo-code-review bot commented Mar 9, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

qodo-code-review bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ddjain commented Mar 6, 2026 •

edited

Loading

qodo-code-review bot commented Mar 9, 2026 •

edited

Loading